{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The autoreload extension is already loaded. To reload it, use:\n",
      "  %reload_ext autoreload\n"
     ]
    }
   ],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "from exp.nb_07 import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Layerwise Sequential Unit Variance (LSUV)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting the MNIST data and a CNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Jump_to lesson 11 video](https://course19.fast.ai/videos/?lesson=11&t=235)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_train,y_train,x_valid,y_valid = get_data()\n",
    "\n",
    "x_train,x_valid = normalize_to(x_train,x_valid)\n",
    "train_ds,valid_ds = Dataset(x_train, y_train),Dataset(x_valid, y_valid)\n",
    "\n",
    "nh,bs = 50,512\n",
    "c = y_train.max().item()+1\n",
    "loss_func = F.cross_entropy\n",
    "\n",
    "data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mnist_view = view_tfm(1,28,28)\n",
    "cbfs = [Recorder,\n",
    "        partial(AvgStatsCallback,accuracy),\n",
    "        CudaCallback,\n",
    "        partial(BatchTransformXCallback, mnist_view)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nfs = [8,16,32,64,64]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ConvLayer(nn.Module):\n",
    "    def __init__(self, ni, nf, ks=3, stride=2, sub=0., **kwargs):\n",
    "        super().__init__()\n",
    "        self.conv = nn.Conv2d(ni, nf, ks, padding=ks//2, stride=stride, bias=True)\n",
    "        self.relu = GeneralRelu(sub=sub, **kwargs)\n",
    "    \n",
    "    def forward(self, x): return self.relu(self.conv(x))\n",
    "    \n",
    "    @property\n",
    "    def bias(self): return -self.relu.sub\n",
    "    @bias.setter\n",
    "    def bias(self,v): self.relu.sub = -v\n",
    "    @property\n",
    "    def weight(self): return self.conv.weight"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn,run = get_learn_run(nfs, data, 0.6, ConvLayer, cbs=cbfs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're going to look at the paper [All You Need is a Good Init](https://arxiv.org/pdf/1511.06422.pdf), which introduces *Layer-wise Sequential Unit-Variance* (*LSUV*). We initialize our neural net with the usual technique, then we pass a batch through the model and check the outputs of the linear and convolutional layers. We can then rescale the weights according to the actual variance we observe on the activations, and subtract the mean we observe from the initial bias. That way we will have activations that stay normalized.\n",
    "\n",
    "We repeat this process until we are satisfied with the mean/variance we observe.\n",
    "\n",
    "Let's start by looking at a baseline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train: [1.73625, tensor(0.3975, device='cuda:0')]\n",
      "valid: [1.68747265625, tensor(0.5652, device='cuda:0')]\n",
      "train: [0.356792578125, tensor(0.8880, device='cuda:0')]\n",
      "valid: [0.13243565673828125, tensor(0.9588, device='cuda:0')]\n"
     ]
    }
   ],
   "source": [
    "run.fit(2, learn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we recreate our model and we'll try again with LSUV. Hopefully, we'll get better results!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn,run = get_learn_run(nfs, data, 0.6, ConvLayer, cbs=cbfs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Helper function to get one batch of a given dataloader, with the callbacks called to preprocess it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def get_batch(dl, run):\n",
    "    run.xb,run.yb = next(iter(dl))\n",
    "    for cb in run.cbs: cb.set_runner(run)\n",
    "    run('begin_batch')\n",
    "    return run.xb,run.yb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xb,yb = get_batch(data.train_dl, run)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We only want the outputs of convolutional or linear layers. To find them, we need a recursive function. We can use `sum(list, [])` to concatenate the lists the function finds (`sum` applies the + operate between the elements of the list you pass it, beginning with the initial state in the second argument)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def find_modules(m, cond):\n",
    "    if cond(m): return [m]\n",
    "    return sum([find_modules(o,cond) for o in m.children()], [])\n",
    "\n",
    "def is_lin_layer(l):\n",
    "    lin_layers = (nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.Linear, nn.ReLU)\n",
    "    return isinstance(l, lin_layers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mods = find_modules(learn.model, lambda o: isinstance(o,ConvLayer))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[ConvLayer(\n",
       "   (conv): Conv2d(1, 8, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2))\n",
       "   (relu): GeneralRelu()\n",
       " ), ConvLayer(\n",
       "   (conv): Conv2d(8, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
       "   (relu): GeneralRelu()\n",
       " ), ConvLayer(\n",
       "   (conv): Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
       "   (relu): GeneralRelu()\n",
       " ), ConvLayer(\n",
       "   (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
       "   (relu): GeneralRelu()\n",
       " ), ConvLayer(\n",
       "   (conv): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))\n",
       "   (relu): GeneralRelu()\n",
       " )]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a helper function to grab the mean and std of the output of a hooked layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def append_stat(hook, mod, inp, outp):\n",
    "    d = outp.data\n",
    "    hook.mean,hook.std = d.mean().item(),d.std().item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mdl = learn.model.cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So now we can look at the mean and std of the conv layers of our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.3813672363758087 0.6907835006713867\n",
      "0.3570525348186493 0.651114284992218\n",
      "0.28284627199172974 0.5356632471084595\n",
      "0.2487572282552719 0.42617663741111755\n",
      "0.15965904295444489 0.2474386990070343\n"
     ]
    }
   ],
   "source": [
    "with Hooks(mods, append_stat) as hooks:\n",
    "    mdl(xb)\n",
    "    for hook in hooks: print(hook.mean,hook.std)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first adjust the bias terms to make the means 0, then we adjust the standard deviations to make the stds 1 (with a threshold of 1e-3). The `mdl(xb) is not None` clause is just there to pass `xb` through `mdl` and compute all the activations so that the hooks get updated. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#export\n",
    "def lsuv_module(m, xb):\n",
    "    h = Hook(m, append_stat)\n",
    "\n",
    "    while mdl(xb) is not None and abs(h.mean)  > 1e-3: m.bias -= h.mean\n",
    "    while mdl(xb) is not None and abs(h.std-1) > 1e-3: m.weight.data /= h.std\n",
    "\n",
    "    h.remove()\n",
    "    return h.mean,h.std"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We execute that initialization on all the conv layers in order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0.17071205377578735, 1.0)\n",
      "(0.08888687938451767, 1.0000001192092896)\n",
      "(0.1499888300895691, 0.9999999403953552)\n",
      "(0.15749432146549225, 1.0)\n",
      "(0.3106708824634552, 1.0)\n"
     ]
    }
   ],
   "source": [
    "for m in mods: print(lsuv_module(m, xb))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the mean doesn't exactly stay at 0. since we change the standard deviation after by scaling the weight."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then training is beginning on better grounds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train: [0.42438078125, tensor(0.8629, device='cuda:0')]\n",
      "valid: [0.14604696044921875, tensor(0.9548, device='cuda:0')]\n",
      "train: [0.128675537109375, tensor(0.9608, device='cuda:0')]\n",
      "valid: [0.09168212280273437, tensor(0.9733, device='cuda:0')]\n",
      "CPU times: user 4.09 s, sys: 504 ms, total: 4.6 s\n",
      "Wall time: 4.61 s\n"
     ]
    }
   ],
   "source": [
    "%time run.fit(2, learn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LSUV is particularly useful for more complex and deeper architectures that are hard to initialize to get unit variance at the last layer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Converted 07a_lsuv.ipynb to exp/nb_07a.py\r\n"
     ]
    }
   ],
   "source": [
    "!python notebook2script.py 07a_lsuv.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
